Putting Visual Analytics into Practical Use.
We are tasked to create a data visualisation to segment kid drinks and other by nutrition indicators. For the purpose of this task, starbucks_drink.csv should be used.
Since we are doing a segmentation of kid drinks via nutritional indicators, we will be paying more attention towards the nutritional values to build a data visualisation that helps us tell a story of whether kids’ drinks in Starbucks are healthy or should parents be treating their kids to just plain water at Starbucks instead as the healthier choice.
Upon observing the variables, we can note down 13 different nutritional indicators - Portion (fl oz), Calories, Calories from fat, Total Fat(g), Saturated fat(g), Trans fat(g), Cholesterol(mg), Sodium(mg), Total Carbohydrate(g), Dietary Fiber(g), Sugars(g), Protein(g), Caffeine(mg).
However, upon further inspection we notice that Portion (fl oz) may not be an effective nutritional indicator since it should be correlated to the drink size ordered. As such, we will first work on determining if there is any correlation between portion and the other nutritional indicators before moving onto buildling our data visualisation.
The last step is in choosing the best kind of illustration to tell our data story on how kids should avoid drinking Starbucks due to the lack of nutritional value. And given the data visualisations we were exposed to in lesson 4, there is a choice between parallel coordinates graph or a heatmap. In this case, a heat map will be the superior choice given how we can combine it with hierachical clustering to determine the clustering of nutritional indicators.
As such, there are two parts to our tasks where (1) We will be building a correlogram to determine correlation between Portion and other nutritional indicators and (2) A heat map of Starbucks drinks to determine the level of nutritional indicators in each drink.
We will be using the following packages:
packages = c('seriation', 'dendextend', 'heatmaply','corrplot', 'tidyverse','kableExtra')
for(p in packages){library
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
As mentioned earlier, we will be using the “starbucks_drink.csv” dataset for to perform our task
sb <- read_csv("data/starbucks_drink.csv")
kable(head(sb))
| Category | Name | Portion(fl oz) | Calories | Calories from fat | Total Fat(g) | Saturated fat(g) | Trans fat(g) | Cholesterol(mg) | Sodium(mg) | Total Carbohydrate(g) | Dietary Fiber(g) | Sugars(g) | Protein(g) | Caffeine(mg) | Size | Milk | Whipped Cream |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| iced-coffee | Cold Brew with Cascara Cold Foam | 12 | 50 | 0 | 0 | 0 | 0 | 0 | 25 | 11 | 0 | 11 | 1 | 145 | Tall | NA | NA |
| iced-coffee | Cold Brew with Cascara Cold Foam | 16 | 80 | 0 | 0 | 0 | 0 | 0 | 30 | 17 | 0 | 17 | 2 | 190 | Grande | NA | NA |
| iced-coffee | Cold Brew with Cascara Cold Foam | 24 | 100 | 0 | 0 | 0 | 0 | 0 | 40 | 22 | 0 | 22 | 2 | 280 | Venti Iced | NA | NA |
| iced-coffee | Cold Brew with Cascara Cold Foam | 30 | 130 | 0 | 0 | 0 | 0 | 0 | 45 | 28 | 0 | 28 | 2 | 320 | Trenta Iced | NA | NA |
| iced-coffee | Iced Coffee | 30 | 160 | 0 | 0 | 0 | 0 | 0 | 15 | 40 | 0 | 39 | 1 | 280 | Trenta Iced | NA | Sweetened |
| iced-coffee | Iced Coffee | 30 | 5 | 0 | 0 | 0 | 0 | 0 | 10 | 0 | 0 | 0 | 1 | 330 | Trenta Iced | NA | Unsweetened |
For the purpose of this task, we are focused solely on kids’ drinks so we will excluding other non-required rows in our dataset.
| Category | Name | Portion(fl oz) | Calories | Calories from fat | Total Fat(g) | Saturated fat(g) | Trans fat(g) | Cholesterol(mg) | Sodium(mg) | Total Carbohydrate(g) | Dietary Fiber(g) | Sugars(g) | Protein(g) | Caffeine(mg) | Size | Milk | Whipped Cream |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| kids-drinks-and-other | Cinnamon Dolce Crème | 12 | 140 | 45 | 5 | 0 | 0 | 0 | 120 | 25 | 1 | 22 | 2 | 0 | Tall | Almond | No Whipped Cream |
| kids-drinks-and-other | Cinnamon Dolce Crème | 12 | 210 | 100 | 11 | 4 | 0 | 20 | 125 | 27 | 1 | 25 | 2 | 0 | Tall | Almond | Whipped Cream |
| kids-drinks-and-other | Cinnamon Dolce Crème | 12 | 170 | 50 | 6 | 5 | 0 | 0 | 130 | 28 | 0 | 26 | 1 | 0 | Tall | Coconut | No Whipped Cream |
| kids-drinks-and-other | Cinnamon Dolce Crème | 12 | 230 | 110 | 12 | 9 | 0 | 20 | 135 | 30 | 1 | 28 | 1 | 0 | Tall | Coconut | Whipped Cream |
| kids-drinks-and-other | Cinnamon Dolce Crème | 12 | 170 | 0 | 0 | 0 | 0 | 5 | 120 | 32 | 0 | 31 | 10 | 0 | Tall | Nonfat milk | No Whipped Cream |
| kids-drinks-and-other | Cinnamon Dolce Crème | 12 | 230 | 60 | 6 | 4 | 0 | 25 | 125 | 34 | 0 | 33 | 10 | 0 | Tall | Nonfat milk | Whipped Cream |
From the details above, we notice that Caffeine(mg) is not classified correctly as it is classified as a character rather than a numerical value instead. As such, we will be converting it into numerical format before moving onto further data analysis tasks.
kids_sb$`Caffeine(mg)` <- parse_number(kids_sb$`Caffeine(mg)`)
Now we can see that Caffeine(mg) has been classified in the correct format.
And with this, we can finally move onto our data visualisation tasks
As we mentioned earlier, we would want to determine if there are nutritional indicators strongly correlated to each other and whether Portion (fl oz) has high correlation with any other nutritional indicators.
Before we being on the task, we will need to filter out only the nutritional indicators for analysis.
kids_sb.cor <- cor(kids_sb[, 3:15])
After filtering the data, we will be using corrplot() to plot the correlogram.
corrplot.mixed(kids_sb.cor,
lower = "ellipse",
upper = "number",
tl.pos = "lt",
diag = "l",
order="AOE",
tl.col = "black")
Given how the correlation values are too big for the graph, we will be adjusting their sizes using tl.cex and number.cex.
corrplot.mixed(kids_sb.cor,
lower = "ellipse",
upper = "number",
tl.pos = "lt",
diag = "l",
order="AOE",
tl.col = "black",
tl.cex = 0.6,
number.cex = 0.6)
Here are some interesting stats we observed from the correlogram:
Based on these observations, we will be reducing the dataset by dividing the relevant nutritional indicators with the mean of the Portion size.
As mentioned in section 1.1 we will be building a heat map to show the relationship between kids’ drinks and nutritional indicators.
We first adjust the drinks name by combining the columns Name, Milk and Whipped Cream using the paste() function.
kids_sb$DrinkName = paste(kids_sb$Name,kids_sb$Milk, kids_sb$`Whipped Cream`)
We then need to collapse the dataset by using groupby() and dividing the dataset by the Portion (fl oz) mean.
kids_sb2 <- kids_sb %>%
group_by(`DrinkName`) %>%
summarise('Calories' = sum(`Calories`)/sum(`Portion(fl oz)`),
'Calories from fat' = sum(`Calories from fat`)/sum(`Portion(fl oz)`),
'Total Fat(g)' = sum(`Total Fat(g)`)/sum(`Portion(fl oz)`),
'Saturated fat(g)' = sum(`Saturated fat(g)`)/sum(`Portion(fl oz)`),
'Trans fat(g)' = sum(`Trans fat(g)`)/sum(`Portion(fl oz)`),
'Cholesterol(mg)' = sum(`Cholesterol(mg)`)/sum(`Portion(fl oz)`),
'Sodium(mg)' = sum(`Sodium(mg)`)/sum(`Portion(fl oz)`),
'Total Carbohydrate(g)' = sum(`Total Carbohydrate(g)`)/sum(`Portion(fl oz)`),
'Dietary Fiber(g)' = sum(`Dietary Fiber(g)`)/sum(`Portion(fl oz)`),
'Sugars(g)' = sum(`Sugars(g)`)/sum(`Portion(fl oz)`),
'Protein(g)' = sum(`Protein(g)`)/sum(`Portion(fl oz)`),
'Caffeine(mg)' = sum(`Caffeine(mg)`)/sum(`Portion(fl oz)`)) %>%
ungroup()
kable(head(kids_sb2))
| DrinkName | Calories | Calories from fat | Total Fat(g) | Saturated fat(g) | Trans fat(g) | Cholesterol(mg) | Sodium(mg) | Total Carbohydrate(g) | Dietary Fiber(g) | Sugars(g) | Protein(g) | Caffeine(mg) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cinnamon Dolce Crème 2% Milk No Whipped Cream | 17.70833 | 4.166667 | 0.4583333 | 0.2916667 | 0.0000000 | 1.979167 | 10.93750 | 2.645833 | 0.0000000 | 2.541667 | 0.7708333 | 0 |
| Cinnamon Dolce Crème 2% Milk Whipped Cream | 22.08333 | 7.916667 | 0.8750000 | 0.5416667 | 0.0208333 | 3.125000 | 11.45833 | 2.770833 | 0.0000000 | 2.708333 | 0.8125000 | 0 |
| Cinnamon Dolce Crème Almond No Whipped Cream | 11.87500 | 3.645833 | 0.3958333 | 0.0208333 | 0.0000000 | 0.000000 | 10.00000 | 2.083333 | 0.1041667 | 1.854167 | 0.1458333 | 0 |
| Cinnamon Dolce Crème Almond Whipped Cream | 16.25000 | 7.291667 | 0.8125000 | 0.2916667 | 0.0000000 | 1.250000 | 10.31250 | 2.229167 | 0.1041667 | 2.020833 | 0.1875000 | 0 |
| Cinnamon Dolce Crème Coconut No Whipped Cream | 13.95833 | 4.375000 | 0.5000000 | 0.4375000 | 0.0000000 | 0.000000 | 10.62500 | 2.333333 | 0.0416667 | 2.145833 | 0.0625000 | 0 |
| Cinnamon Dolce Crème Coconut Whipped Cream | 18.54167 | 8.125000 | 0.9166667 | 0.6875000 | 0.0000000 | 1.250000 | 10.93750 | 2.479167 | 0.0625000 | 2.312500 | 0.1041667 | 0 |
From the graph above, we see that there are a total number of 60 unique drinks.
We then need to set the drink names as the row number before transforming the new dataset into a data matrix so we can build a heat map.
row.names(kids_sb2) <- kids_sb2$DrinkName
kids_sb_matrix <- data.matrix(kids_sb2)
We will be building the heat map using heatmaply(). We will first build a test heat map with the default clusters before identifying the best number of clusters later.
heatmaply(normalize(kids_sb_matrix[, -c(1)]),
Colv=NA,
seriate = "none",
colors = Greens,
fontsize_row = 4,
fontsize_col = 5,
)
And now to make the heat map better, we will be identifying the best clustering method and the best number of clusters.
To find the best clustering method, we will be utilising dend_expend().
kids_sb_matrix2 <- dist(normalize(kids_sb_matrix[, -c(1)]), method = "euclidean")
dend_expend(kids_sb_matrix2)[[3]]
dist_methods hclust_methods optim
1 unknown ward.D 0.5614832
2 unknown ward.D2 0.6088735
3 unknown single 0.6646756
4 unknown complete 0.6243221
5 unknown average 0.7387914
6 unknown mcquitty 0.6958625
7 unknown median 0.5369151
8 unknown centroid 0.6061457
The output indicates that the ‘average’ method should be used since it has the highest optimum value.
And to determine the best number of clusters, we will be using find_k().
kids_sb_cluster <- hclust(kids_sb_matrix2, method = "average")
kids_sb_k <- find_k(kids_sb_cluster)
plot(kids_sb_k)
From the figure above, we see that k = 10 is the optimal number of clusters.
With the best clustering method and clusters identified earlier, we will then replot the heat map while adding in more details such as the titles and labels.
heatmaply(normalize(kids_sb_matrix[,-c(1)]),
dist_method = "euclidean",
hclust_method = "average",
seriate = "none",
show_dendrogram = c(TRUE, FALSE),
k_row = 10,
colors = Greens,
margins = c(NA,200,60,NA),
fontsize_row = 4,
fontsize_col = 5,
xlab = "Nutritional Indicators",
ylab = "Drink Types",
main="Starbucks Kids' Drinks nutrition \nindicator by Drink Types",
Colv = NA
)
From the heat map, we can see that drinks containing Salted Caramel have the highest amount of calories due to a high amount of sodium, cholesterol and carybohydrates. This shows that salted caramel is the most unhealthy ingredient in Starbucks and kids should avoid it if they can.
We can also see that Hot Chocolate drinks also have a high amount of calories, total fat, cholesterol and sugars. This is further exacerbated by the fact that kids like to order it in combination with Whipped Cream and Salted Caramel.
As mentioned earlier, Whipped Cream also contributes to a high amount of calories due to a higher amount of total fats. As such, kids should try their best to avoid ordering whipped cream to reduce their calorie count.
And upon further observation, drinks that contain any form of Milk has a higher amount of calories due to a higher amount of saturated fats which leads to a higher amount of total fats. Kids should be aware of this and try to avoid adding milk to their drinks.
Interestingly, Hot Chocolate and Pumpkin Spice drinks have the highest amount of caffeine compared to the other drinks. As such, kids should avoid these drinks if they can or they would be packed with caffeine and be restless the entire day.
As for the healthy drinks, consumers should go for Creme drinks which has a lower calorie count compared to the other drinks. They can consume this in combination with no whipped cream and no milk for the lowest amount of calories.